Detecting communities in large networks 
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We develop an algorithm to detect community structure in complex networks. The algorithm is 
based on spectral methods and takes into account weights and links orientations. Since the method 
detects efficiently clustered nodes in large networks even when these are not sharply partitioned, 
it turns to be specially suitable to the analysis of social and information networks. We test the 
algorithm on a large-scale data-set from a psychological experiment of word association. In this 
case, it proves to be successful both in clustering words, and in uncovering mental association 
patterns. 
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Measurements and exact results concerning the clus- 
tering patterns of networks mainly concern the occur- 
rence of regular motifs Q, |^ H, Q and their correlations 
1^ 0, 0|- However, many social and information net- 
works, such as the World Wide Web, turn out to be 
approximately partitioned into communities of irregular 
shape: for example, web pages focusing on similar top- 
ics are strongly mutually connected and have a weaker 
linkage to the rest of the Web. The design of methods 
to partition a graph into several meaningful highly inter- 
connected components have then become a compelling 
application of graph theory to biological, social and in- 
formation networks (St. i9i.(10l[llj|. 

Detecting the community structure in information net- 
works allows one to mine information in a more efficient 
way, narrowing the exploration of a network as large as 
the World Wide Web (about 10^ nodes) to a hmited por- 
tion of it. When used in the analysis of large collabora- 
tion networks, such as company or universities, commu- 
nities reveal the informal organization and the nature of 
information flows through the whole system 0, ^| . 

There are several empirical methods to detect com- 
munities. The most successful algorithm, recently in- 
troduced by 10 (NG-algorithm), is based on the edge 
betweenness, that measures the fraction of all shortest 
paths passing on a given link, or, alternatively, the prob- 
ability that a random walk on the network runs over 
that link. By removing links with high betweenness, one 
progressively splits the whole network into disconnected 
components, until the network is decomposed in commu- 
nities consisting of one single node. The outcome of the 
algorithm is represented by a dendrogram, i.e. a tree-like 
diagram where each branching corresponds to a splitting 
event. Though this method has been shown to be very 
powerful in cases where some a priori knowledge of the 
a community structure is given, it has two main disad- 
vantages: first, that it does not give an indication of the 
resolution of the clustering, and thus it needs extra infor- 
mation as input (like the expected number of clusters); 
second, that its outcome is independent on how sharp the 
partitioning of the graph is. In the same spirit, |l4l| pro- 
posed an algorithm based on local analogues of the edge 
betweenness. This has the advantage of being faster, but 



has the same drawbacks on the NG-algorithm. 

An alternative way to tackle the problem, which is 
the one we pursue, is by spectral analysis. Previous 
approaches to graph partitioning from spectral analy- 
sis have been mostly developed in the computer science 
community to the purpose of finding the best allocation 
of processes on processors in parallel computers, and are 
based on iterative bisection. When applied to find com- 
munities structures these methods have the disadvantage 
that repeated bisection is not guaranteed to reach the 
best or most natural partition in general cases. More- 
over, they suffer from the same limitation of the algo- 
rithm based on the edge betweenness, since they give no 
indication of when the bisection should terminate, and 
thus need extra information on the expected number of 
communities. 

Our aim in this paper is to develop some spectral based 
algorithm able to reveal the structure of a complex net- 
work, which could be blurred by the bias artificially over- 
imposed by the iterative bisection constraint. Such a 
method should be able to conjugate the power of spec- 
tral analysis to the caution needed to reveal an underly- 
ing structure when there is no clear cut partitioning, as 
is often the case in real networks. 

Spectral methods are based on the analysis of the 
adjacency matrix A 0, 0, 0, whose element a,., is 
equal to 1 if i points to j and otherwise. In partic- 
ular, such methods analyze simple functions of A: the 
Laplacian matrix L = K — A and the Normal matrix 
N = K~^A^ where K is the diagonal matrix with ele- 
ments kii = X]J=i ^ij the number of nodes in 
the network. In most approaches, referring to undirected 
networks, A is assumed to be symmetric. 

The matrix N has always the largest eigenvalue equal 
to one, associated to a trivial constant eigenvector, due 
to row normalization. In a network with an apparent 
cluster structure, N has also a certain number m — 1 
of eigenvalues close to one, where m is the number of 
well defined communities, the remaining eigenvalues ly- 
ing a gap away from one. The eigenvectors associated 
to these first m — 1 nontrivial eigenvalues, also have a 
characteristic structure: the components corresponding 
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to nodes within the same cluster have very similar values 
Xi, so that, as long as the partition is sufficiently sharp, 
the profile of each eigenvector, sorted by components, is 
step-like. The number of steps in the profile corresponds 
again the number m of communities. A similar infor- 
mation is encoded in the non-negative definite Laplacian 
matrix, where the eigenvalues close to zero are associated 
to clusters. 

The study of the eigenvectors profiles and the eigenval- 
ues has practical use only when a clear partition exists, 
which is rarely the case. In most common occurrences, 
the number of nodes is too large and the separation be- 
tween the different communities is rather smooth. Thus 
communities cannot be simply detected by looking at the 
first nontrivial eigenvector. We resolve this issue by com- 
bining information from the first few eigenvectors, and 
extracting the community structure from correlations be- 
tween the same components in different eigenvectors. 

To describe the method in detail and understand why 
it works, it is instructive to recast the eigenproblem into 
an optimization problem. With the most general appli- 
cations in mind, instead of the adjacency matrix A, we 
focus on the weight matrix W, whose elements Wij are 
assigned the intensity of the link We consider undi- 

rected graphs first, and then we pass to the most general 
directed case. Consider the following constrained opti- 
mization problem: Let z{x.) be defined as 



z(x) 



1 ^ 



Xi) w. 



(1) 



where values assigned to the nodes, with some 

constraint on the vector x, expressed by 




FIG. 1: Network employed as an example, with S' = 19 and 
random weights between 1 and 10 assigned to the links. Three 
clear clusters appear, composed by nodes — 6, 7 — 12 and 
13-19. 
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where are elements of a given symmetric matrix M . 

The stationary points of z over all x subject to the 
constraint (0) are the solutions of 

{D - W)x = /iAf x , (3) 

where D is the diagonal matrix dij = Sij X]fe=i ''^ik, a-nd 
/J, is a Lagrange multiplier. 

Different choices of the constraint M leads to different 
eigenvalues problems: for example choosing M — D leads 
the eigenvalues problem D~^Wx = (1 — 2/i)x, while M = 
1 leads to (D - W)x = fix. Thus M = D and M = 1, 
corresponds to the eigenproblems for the (generalized) 
Normal and Laplacian matrix respectively. 

Thus, solving the eigenproblem is equivalent to min- 
imizing the function ^ with the constraint |(2J), were 
the Xi's are eigenvectors components. The absolute min- 
imum corresponds to the trivial eigenvector, which is con- 
stant. The other stationary points correspond to eigen- 
vectors where components associated to well connected 
nodes assume similar values. 



FIG. 2: Values of the 2nd eigenvector components for matrix 
D~^W relative to the graph depicted in figure 



In order to compute cluster sizes and distribution, 
methods such as bisection or edge-betweenness based 
ones are very poor in detect the end of the recursive 
splitting. Our approach, instead, immediately detects 
the number of clear clusters from the eigenvectors pro- 
file. 

As an illustrative example, we show in Fig[21the profile 
of the second eigenvectors of D~^W corresponding to 
the simple graph shown in FigQ with S* = 19 nodes, 
where random weights between I and 10 were assigned 
to the links. The components of the eigenvectors assume 
approximatively constant values on nodes belonging to 
the same community. Thus, the number of communities 
emerges naturally and it is not needed as input, . 

However, as aforementioned, when dealing with large 
networks with no clear partitioning, the precise value of 
the eigenvector components is of little use. In such sit- 
uations, the typical eigenvector profile is not step-like, 
but resembles a continuous curve. Nevertheless, our 
method can still be applied, and efficiently detects sets of 
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well connected nodes. In fact, components correspond- 
ing to nodes belonging to the same communities are still 
strongly correlated taking, in each eigenvector, similar 
values among themselves. Thus, a natural way to iden- 
tify communities in an automatic manner, is by measur- 
ing the correlation 

where the average (•) is over the first few nontrivial eigen- 
vectors. The quantity r^j measures the community close- 
ness between node i and j. Though the performance may 
be improved by averaging over more and more eigenvec- 
tors, with increased computational effort, we find that 
indeed a small number of eigenvectors suffices to iden- 
tify the community to which nodes belong, even in large 
networks. 

When dealing with a directed network, links do not 
correspond to any equivalence relation. Rather, point- 
ing to common neighbors is a significant relation, as 
suggested in the sociologists' literature where this quan- 
tity measures the so-called structural equivalence of nodes 
[l8l |. Accordingly, in a directed network, clusters should 
be composed by nodes pointing to a high number of 
common neighbors, no matter their direct linkage. For 
directed networks, we thus modi fy o ur method in the 
streamline of the HITS algorithm T?|. The HITS algo- 
rithm was proposed on empirical bases to find the main 
communities in large oriented networks. It assumes that 
the largest components (in the absolute value) of eigen- 
vectors of the matrices AA^ and A'^A correspond to 
highly clustered nodes belonging to a single community. 
Such algorithm efficiently detects the main communities, 
even when these are not sharply defined. However, it be- 
comes computationally heavy when one is interested in 
minor communities, which correspond to smaller eigen- 
values. As explained in the undirected case, we tackle 
this issue by combining information from the first few 
eigenvectors, and extracting the community structure 
from correlations between the same components in dif- 
ferent eigenvectors. 

To detect the community structure in a directed net- 
work, we therefore replace, in the previous analysis, the 
matrix W with a matrix Y = WW^ . This corre- 
sponds to replacing the directed network with an undi- 
rected weighted network, where nodes pointing to com- 
mon neighbors are connected by a link, whose intensity 
is proportional to the total sum of the weights of the 
links pointing from the two original nodes to the com- 
mon neighbors. Then, one performs the analysis on the 
undirected network as described previously. Thus, the 
function to minimize in this case is 

?/(x) = ^(a^j - XjfwiiWji . (5) 

ijl 

Defining Q as the diagonal matrix q.y = Sij '^u'^ji^ 



the eigenvalue problem for the analogous of the general- 
ized normal matrix, 

g^^Fx = Ax (6) 

is equivalent to minimizing the function jSJl under the 
constraint ^iX^qij = 1. 

Tested on simple examples of directed networks, the al- 
gorithm associated to the minimization of y, outperforms 
the one based on the minimization of z. 

To test this spectral correlation-based community de- 
tection method on a real complex network, we apply the 
algorithm to data from a psychological experiment re- 
ported in reference . Volunteering participants to the 
research had to respond quickly by freely associating a 
word (response) to another word given as input (stim- 
ulus), extracted by a fixed subset. Scientists conduct- 
ing the research have recorded all the stimuli and the 
associated responses, along with the occurrence of each 
association. In the same spirit of past works |23|, we 
construct a network were words are nodes, and directed 
links are drawn from each stimulus to the correspond- 
ing responses, assuming that a link is oriented from the 
stimulus to the response. The resulting network includes 
S = 10616 nodes, with an average in-degree equal to 
about 7. Taking into account the frequency of responses 
to a given stimulus, we construct the weighted adjacency 
matrix W . In this case, passing to the matrix Y means 
that we expect stimuli giving rise to the same response 
to be correlated. 

The large-scale properties of semantic 0, |^ 
and syntactic networks j23 corresponding to different lan- 
guages have been examined in past literature, mainly 
based on dictionaries and texts: a strong similarity has 
emerged in such surveys, showing that statistical features 
must refer to a common underlying structure rather than 
to individual cultures. Interestingly, word graphs studied 
so far are found to be complex networks, characterized 
by the small world property and by power-law degree dis- 
tribution independently of the specific definition of the 
network p^. 

The word association network is an ideal playground 
to test our algorithm as, despite the large size of the 
networks, the quality of clustering can be evaluated by 
a direct inspection to the yieldings. In large databases 
like this, were a partition in communities is not defined 
in a natural manner, there is no definite answer to what 
the best partition is. Rather, one is interested in finding 
groups of highly correlated nodes, or groups of nodes 
highly connected to a given one. Table shows the most 
correlated words to three test-words. The correlation are 
computed by averaging over just 10 eigenvectors of the 
matrix Q^^Y: the results appear to be quite satisfactory, 
already with this small number of eigenvectors. 

Besides the performance in finding clusters of corre- 
lated words, our results are suggestive of the criteria ac- 
cording to which the participants to the experiment have 
associated words. As we observed, free associations are 
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science 


1 


hterature 


1 


piano 


1 


scientific 


0.994 


dictionary 


0.994 


cello 


0.993 


chemistry 


0.990 


editorial 


0.990 


fiddle 


0.992 


physics 


0.988 


synopsis 


0.988 


viola 


0.990 


concentrate 


0.973 


words 


0.987 


banjo 


0.988 


thinking 


0.973 


grammar 


0.986 


saxophone 


0.985 


test 


0.973 


adjective 


0.983 


director 


0.984 


lab 


0.969 


chapter 


0.982 


violin 


0.983 


brain 


0.965 


prose 


0.979 


clarinet 


0.983 


equation 


0.963 


topic 


0.976 


oboe 


0.983 


examine 


0.962 


English 


0.975 


theater 


0.982 



TABLE I: The words most correlated to science, literature 
and piano in the eigenvectors of Q~^WW^ . Values indicate 
the correlation. 

made by synonymy or antinomy, syntactic role, and even 
by analogous sensory perception. 

In conclusion, we have introduced a new method to 
detect communities of highly connected nodes within a 



network. The method is based on spectral analysis and 
takes into account the presence of weighted links be- 
tween nodes. Unlike previous spectral approaches, our 
method is not based on iterative bisection. We have 
tested our algorithm on a real network instance, built 
upon the records of a psychological experiments. The 
algorithm proves to be successful in clustering nodes (in 
this case, words) according to reasonable criteria, and 
provides an automatic way to extract the most connected 
sets of nodes to a given one in a set of over 10^. Given the 
broad range of applicability, such method suggests a re- 
liable way of clustering large-scale networks occurring in 
different fields, including biology, computer science and 
sociology. 
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mon Ferrer i Cancho and Miguel- Angel Mufioz. 

We acknowledge partial support from the FET Open 
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